Baltimore City Employee Salaries¶

If you're viewing this on your phone, I'd recommend enabling desktop mode. In Google Chrome, you can do so by tapping the three dots at the top right of your screen. Desktop Site usually has a checkbox next to it that you fill to enable the change. You can zoom in and out of any section with your fingers. My apologies for the hassle and thanks for reading!

Here, I go through an exploratory analysis of Baltimore city employee salaries. I obtained all data through the City of Baltimore's Open Baltimore Data platform (OBD).

I created a small Python function that batches calls to the OBD Rest API. I connected the GitHub repo I stored the python code in to a gcloud source repo. Each time I push a change to the master branch, GH notifies gcloud and Cloud Build uses the updated cloud source repo to deploy a new version to Cloud Function. Since OBD appears to only update the data once per year or so, I run the function manually. However, Cloud Scheduler would run my function at any interval I want using cron.

Once the function finishes querying OBD, it stores the data in a csv in Cloud Storage. From there, I connect the csv to a BigQuery table. I considered updating the function to write directly to the BigQuery table, but I want to see how regularly OBD updates this data set going forward and if they change the schema. They've changed the schema for this dataset at least once that I've seen. I'll add some other details about this data set and a flow chart in the technical notes below.

Technical Notes and Disclosures¶

  • OBD only has one name field, which made it difficult to group records by person for analysis. For example, I previously combined the firstName, lastName, and middleInitial fields with AgencyID to create an employeeSlug that I use to track workers across organizations. With just the Name and AgencyID fields to work with in this release, tracking individual workers is difficult, though not impossible. It'll definitely be the work of a future analysis. This initial analysis focusses on agency-level salary measures.
  • I excluded salary records where the agency had less than 500 salary records total or the record omitted the HireDate field. I wanted to keep the initial anaylsis as robust as possible, so I elected to clean the data in this way.
  • I also excluded records that were missing data in the annualSalary or hireDate fields since those fields were essential to my measures for this initial analysis.
  • I regrouped the raw AgencyName values because the data set duplicated some agency names, made distinctions inside of an agency I didn't want to keep, etc. For example, OBD broke out "Rec & Parks" to "Rec & Parks - Parks", "Rec & Parks - part-time", "Rec & Parks - Admin", etc.

City Salaries Data Set API Request Flow

First Things First¶

Gotta get the data in! As I noted in the beginning, I stored this data in a BigQuery table. So the first step is to query the table and bring in the salary and employee name data I want. After I created a Google Cloud account and project, I installed the Google Cloud SDK for Python. Lots of trial and error, but overall a pretty straigtforward process. Unfortunately, I don't have a very comprehensive set of links to show y'all how to get started on Google Cloud if you wanted to recreate this setup, but I'll list the basic steps below:

  1. Create a Google Cloud account (pretty easy if you already have a gmail, also easy if you don't)
  2. Create a Google Cloud project
  3. Create a BigQuery table
  4. Install the Cloud Client Libraries for Python
  5. Install the gcloud CLI
  6. Set up Application Default Credentials
    • I do have a link for this! and it's what allows me to just supply a project_id and nothing else to connect to my gcloud instances. Highly recommend this and it was very easy.

Once all that is done, you can run some version of the code below and bring your data in for analysis.

In [1]:
# import necessary libraries
from google.cloud import bigquery
import pandas as pd
import numpy as np
import datetime as dt
from dateutil import relativedelta as rd
import plotly.express as px

# build client to connect to open-baltimore project
project_id = "open-baltimore-data"
client = bigquery.Client(project=project_id)
In [2]:
# import employee name info
employee_salaries_query = client.query(
    """
    SELECT 
        info.employeeSlug,
        main.objectId,
        TRIM(REGEXP_REPLACE(main.agencyName, r'\(.*?\)', '')) as agencyName,
        main.agencyId,
        main.annualSalary,
        main.fiscalYear,
        main.hireDate
    FROM city_employee_salaries.main as main
    LEFT JOIN city_employee_salaries.employee_info as info
    ON main.ObjectId = info.ObjectId
    """
)
employee_salaries = employee_salaries_query.result().to_dataframe()

Cleaning, Cleaning, Cleaning¶

Here, I created my own dictionary for translating existing city agency names to my own. Then I used map to populate a cleanAgencyName column with the new names. map loops over each record in the existing agencyID field, checks my dictionary for a match, then returns the matches in the new column. It's a super fast function and the syntax is very simple. Lastly, I converted the hireDate field to datetime format so I can calculate tenure.

In [3]:
# clean fields for analysis
agency_dict =  {
    "A01": "Mayors Office",
    "A02": "City Council",
    "A02": "Mayors OED",
    "A04": "Rec & Parks",
    "A05": "MONSE",
    "A06": "Housing & Community Dev",
    "A08": "M-R Human Services",
    "A09": "Liquor License Board",
    "A10": "Mayors Office of Children & Families",
    "A11": "Office of the Inspector General",
    "A12": "Finance",
    "A14": "Finance",
    "A15": "Comptrollers Office",
    "A16": "Comptrollers Office",
    "A17": "Finance",
    "A18": "Finance",
    "A19": "City Planning",
    "A23": "Finance",
    "A24": "Comptroller - Audits",
    "A26": "M-R Labor Commissioner",
    "A28": "Wage Commissioner",
    "A29": "States Attorneys Office",
    "A30": "Law Department",
    "A31": "Circuit Court",
    "A32": "Finance",
    "A33": "Legislative Reference",
    "A35": "Elections",
    "A37": "Orphans Court",
    "A38": "Sheriffs Office",
    "A39": "311",
    "A40": "BCIT",
    "A41": "DPW - Admin",
    "A44": "M-R Cable & Comms",
    "A46": "Environmental Control Board",
    "A49": "Transportation",
    "A50": "DPW - Waste & Wastewater",
    "A51": "Office of Equity & Civil Rights",
    "A52": "Employee Retirement System",
    "A53": "Finance",
    "A54": "Retirement - Fire & Police",
    "A57": "City Council Services",
    "A64": "Fire Department",
    "A65": "Health Department",
    "A67": "Rec & Parks",
    "A68": "Rec & Parks",
    "A70": "DPW - Solid Waste",
    "A73": "Municipal Zoning & Appeals",
    "A75": "Enoch Pratt Free Library",
    "A83": "Human Resources",
    "A84": "Transportation",
    "A85": "General Services",
    "A86": "War Memorial Commission",
    "A88": "Comptroller - Comms",
    "A90": "Transportation",
    "A91": "Convention Center",
    "A99": "Police Department",
    "A9": "Police Department",
    "B49": "Transportation",
    "B68": "Rec & Parks",
    "B70": "DPW - Solid Waste",
    "BPD": "Police Department",
    "C90": "Transportation - Crossing Guards",
    "P04": "Rec & Parks",
    "P65": "Health Department",
    "P83": "HR Test Monitor",
    "R01": "R01",
    "U01": "U01",
    "SCS": "Special City Services",
    "W02": "Youth Summer Works",
    "W03": "Youth Cust",
    "W07": "Youth Temp Adult",
    "W08": "TANF Cust"
}
employee_salaries_clean = employee_salaries.copy(deep=True)
employee_salaries_clean = employee_salaries_clean[employee_salaries["annualSalary"].notnull()
                                      & employee_salaries["annualSalary"] != 0]
employee_salaries_clean['cleanAgencyName'] = employee_salaries_clean['agencyId'].map(agency_dict)
employee_salaries_clean = employee_salaries_clean[employee_salaries_clean["hireDate"].notnull()]
employee_salaries_clean['hireDate'] = pd.to_datetime(employee_salaries_clean["hireDate"], unit="ms")
# may need to implement check and ensure tenure across unique employees is the same
employee_salaries_clean['tenure'] = \
    employee_salaries_clean['hireDate'].map(lambda hire_date: rd.relativedelta(dt.datetime.now(), hire_date).years)                                       
In [4]:
# count number of records per city agency
agency_record_count = employee_salaries_clean.groupby(["cleanAgencyName"], as_index=False).agg(
    salaryRecords=pd.NamedAgg(column="cleanAgencyName", aggfunc="count")
    )
agencies_with_500_records = agency_record_count[agency_record_count["salaryRecords"] >= 500]["cleanAgencyName"].values.tolist()   
# calculate mean salary, max salary, min salary, growth, salary records, tenure, average raise by employee and agency
employee_salary_quality = employee_salaries_clean.groupby(["employeeSlug", "cleanAgencyName"], as_index=False).agg(
    medSalary=pd.NamedAgg(column="annualSalary", aggfunc="median"),
    highestSalary=pd.NamedAgg(column="annualSalary", aggfunc="max"),
    lowestSalary=pd.NamedAgg(column="annualSalary", aggfunc="min"),
    growth=pd.NamedAgg(column="annualSalary", aggfunc=lambda salary: max(salary) - min(salary)),
    salaryRecords=pd.NamedAgg(column="annualSalary", aggfunc="nunique"),
    tenure=pd.NamedAgg(column="tenure", aggfunc="first"),
    avgRaise=pd.NamedAgg(column="annualSalary", aggfunc=lambda salary: (max(salary) - min(salary)) / len(salary))
    )
agency_salary_quality = employee_salary_quality.groupby(["cleanAgencyName"], as_index=False).agg(
    medSalary=pd.NamedAgg(column="medSalary", aggfunc="median"),
    medHighestSalary=pd.NamedAgg(column="highestSalary", aggfunc="median"),
    medLowestSalary=pd.NamedAgg(column="lowestSalary", aggfunc="median"),
    medGrowth=pd.NamedAgg(column="growth", aggfunc="median"),
    medTenure=pd.NamedAgg(column="tenure", aggfunc="median"),
    medRaise=pd.NamedAgg(column="avgRaise", aggfunc="median")
    ).query("cleanAgencyName in @agencies_with_500_records")
In [5]:
# add salary box plot
ordered_salary_data = employee_salaries_clean.query("cleanAgencyName in @agencies_with_500_records").loc[:, ["cleanAgencyName", "annualSalary"]] \
    .groupby(["cleanAgencyName"]) \
    .median() \
    .sort_values(by="annualSalary", ascending=False)
fig = px.box(employee_salaries_clean.query("cleanAgencyName in @agencies_with_500_records"), 
             x="cleanAgencyName", 
             y="annualSalary", 
             points=False, 
             category_orders={"cleanAgencyName": ordered_salary_data.index.to_list()}
             )
fig.show()